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ABSTRACT 

We investigate the quality of associations of astronomical sources from multi-wavelength observa- 
tions using simulated detections that are realistic in terms of their astrometric accuracy, small-scale 
clustering properties and selection functions. We present a general method to build such mock cat- 
alogs for studying associations, and compare the statistics of cross-identifications based on angular 
separation and Bayesian probability criteria. In particular, we focus on the highly relevant problem 
of cross-correlating the ultraviolet GALEX and optical SDSS surveys. Using refined simulations of 
the relevant catalogs, we find that the probability thresholds yield lower contamination of false asso- 
ciations, and are more efficient than angular separation. Our study presents a set of recommended 
criteria to construct reliable crossmatch catalogs between SDSS and GALEX with minimal artifacts. 
Subject headings: astrometry - catalogs - methods: statistical 
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1. INTRODUCTION 

Astrophysical studies can gain significantly by associ- 
ating data from different wavelength ranges of the elec- 
tromagnetic spectrum. Dedicated multi-wavelength sur- 
veys have been a strong focus o f observational ast ronomy 
in rec ent years, e.g. AEG IS dDavis et al.l I2007T). COS- 
MOS (jScoville et alJl2007h . or GOODS (|Dickinson et all 
120031 ). At redshifts lower than those probed by these 
surveys, several s urveys of NASA's G alaxy Evolution Ex- 
plorer (GALEX: iMartin et"aLll2005f) essentially provide 
the perfect ultraviol et counterparts o f the Sloan Digital 
Sky Survey (SDSS; lYork et al.l 120001 ) optical data sets. 
These surveys or the combination of these datasets en- 
ables to provide invaluable insights on stars and galaxy 
properties. 

Naturally, these data are taken by different detectors 
of the separate projects, hence it is required to combine 
their information by ass ociating the independent d etec- 
tions. Recent work by iBudavari fc Szalavl (|2008f ) laid 
down the statistical foundation of the cross-identification 
problem. Their probabilistic approach assigns an ob- 
jective Bayesian evidence and subsequently a posterior 
probability to each potential association, and can even 
consider physical information, such as priors on the spec- 
tral energy distribution or redshift, in addition to the 
positions on celestial sphere. In this paper, we put the 
Bayesian formalism to work, and aim to assess the ben- 
efit of using posterior probabilities over simple angu- 
lar separation cuts using mock catalogs of GALEX and 
SDSS. In Section [2l we present a general procedure to 
build mock catalogs that take into account source con- 
fusion and selection functions. Section [3] provides the 
details of the cross-identification strategy, and defines 
the relevant quality measures of the associations based 
on angular separation and posterior probability. In Sec- 
tionHJ we present the results for the GALEX-SDSS cross- 
identification, and propose a set of criteria to build reli- 
able combined catalogs. 

2. SIMULATIONS 

The goal is to mimic as close as possible the pro- 
cess of observation and the creation of source lists. 



First, a mock catalog of artificial objects is generated 
with known clusteri n g pro perties, using the method of 
IPons-Borderia et al.l ([1999). We then complement this 
by adding observational effects that are not included in 
this method. We generate simulated detections as obser- 
vations of the artificial objects with given astronometric 
accuracy and selections. Hence the difference between 
separate sets of simulated detections, say for GALEX 
and SDSS, is not only in the positions, but also they are 
different subsets of the mock objects. 

2.1. The Mock Catalog 

We built the mock catalog as a combination of clus- 
tered sources (for galaxies) and sources with a random 
distribution (for stars). To simulate clustered sources, 
we generate a realization of a Cox point process, follow- 
ing the method described bv lPons-Borderia et al.l (jl999) . 
This point process has a known correlation function 
which is similar to that observed for galaxies. We cre- 
ate such a process within a cone of lGpc dept h; assum- 
ing the notation of IPons-Borderia et alJ (|1999h . we used 
A s =0.1 and I = l/i _1 Mpc for the Cox process parame- 
ters. For our purpose, it is sufficient that the distribution 
on the sky (i.e., the angular correlation function) of the 
mock galaxies displays clustering up to scales equal to the 
search radius used for the cross-identification (5" here) 
and that this distribution is similar to the actual one. 
Figure [1] shows the angular correlation function of our 
mock galaxy sample ( filled squares) a l ong w ith the mea- 
surement obtained bv lConnollv et all (|2002l ) from SDSS 
galaxies with 18 < r* < 22. Note that the galaxy cluster- 
ing is not well known at small scales (8 < 10") because 
of the combination of seeing, point spread function, etc. 
Hence there is no constraint in his regime. There is nev- 
ertheless a good overall agreement between our mock cat- 
alog and the observations at scales between 10 and 30". 

In the case of both GALEX and SDSS, galaxies and 
stars show on average similar densities over the sky. We 
create a mock catalog over 100 sqdeg with a total of 10 7 
sources, half clustered and half random. The minimum 
Galactic latitude at which this mock catalog is represen- 
tative is around 25°. For this case study we do not con- 
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Fig. 1. — Angular correlation function of mock galaxies 
(filled squares) compared to the angular correlation func- 
tion of SPSS g alaxie s selected with 18 < r* < 22, from 
IConnollv et~aLl i|2002T ) (filled circles). 

sider the variation of star density with Galactic latitude; 
we note that several mock catalogs can be constructed 
with different star densities, and prior probabilities (see 
sect. varying accordingly. 

2.2. Simulated Detections 

From our mock catalog we create two sets of simu- 
lated detections, using the approximate astrometry er- 
rors of the surveys we consider. We assume that the 
errors are Gaussian, and create two detections for each 
mock object: a mock SDSS detection with as, and a 
mock GALEX one with oq. We consider constant errors 
for SDSS, and variable errors for GALEX. For GALEX 
we focus here on the case of the Medium Imaging Survey 
(MIS); we will consider two selections: all MIS objects, 
or MIS objects with signal-to- noise ratio (S/N) larger 
than 3. We randomly assign to the mock sources errors 
from objects of the GALEX datasets following the rele- 
vant selections and using the position error in the NUV 
band (nuv_poserr). The distributions of these errors are 
shown on figure O In the case of GALEX, the position 
errors are defined as the combination of the Poisson er- 
ror and the field error, added in quadrature. The latter 
is assumed to be constant over the field (and equal to 
0.42" in NUV). For SDSS we assume that a s = 0.1" for 
all objects. Our results are unchanged if we use variable 
SDSS errors for our SDSS mock detections, as the SDSS 
position errors are significantly smaller than the GALEX 
ones. 

2.3. Selection function and confusion 

Our goal is to use the mock catalog described above 
as a predictive tool in order to assess the quality of the 
cross-identifications between two datasets, here GALEX 
and SDSS. Hence our mock catalog has to present similar 
properties than the data. In practice we need to include 
two effects: the selection functions of both catalogs in 
order to match the number density of the data, as well 
as the confusion of detections caused by the combination 



Fig. 2. — Distribution of astrometry errors for simulated de- 
tections. The solid line shows errors on nuv detections for the 
selection of all GALEX MIS objects, and dotted line for the 
MIS objects with S/N > 3. These distributions are normal- 
ized by their integrals. 



of the seeing and point spread functions. 

To apply the selection function, we assign to each mock 
source a random number u, drawn from an uniform dis- 
tribution, which represents the property of the objects. 
We use the values of u to select the simulated detec- 
tions we further consider to study a given case of cross- 
identification. The length of the interval in u sets the 
density for a given mock c atalog. Using the notations of 
IBudavari fc Szalavl (|2008f ). we computed the number of 
SDSS GR7 sources, A^sdss and GALEX GR5, A^galex, 
and scaled them to the area of our mock catalog. These 
numbers set the interval in u for both detection sets. We 
then use the overlap between the intervals in u to set the 
density of common objects, as set by the prior proba- 
bility determined independently from the data (see sect. 

To simulate the confusion of the detections, we per- 
formed the cross-identification of the SDSS and GALEX 
detections sets with themselves, using a search radii of 
1.5" and 5" respectively. These values of search radius 
correspo nd to the effective widths of the PSF in bo th 
surveys (|Stoughton et all 120021 : iMorrissev et~afl 120071 ) 
We then consider only the detections that satisfy the se- 
lection function criterion, and merge them. For SDSS, 
we keep one source chosen randomly from the various 
identifications. For GALEX, we keep the source with 
the largest position error. 

This procedure is repeated for each cross-identification 
we consider, as modifying the selection function naturally 
implies a change in the number densities and priors. 

3. CROSS IDENTIFICATION 

We performed the cross-identification between the 
SDSS and GALEX d etection sets using a 5" r adius. For 
each association (see IBudavari fc Szalavl 120081 ) . we com- 

1 see also http: //www. sdss . org/DR7/products/general/seeing.html 



3 



pute the Bayes factor 
B(ip;as,o- G ) = — 



■ exp 



^ 2 



2(a 



1 + 4) 



(1) 



where ip is the angular separation between the two de- 
tections, and is expressed here in radians, as as and oq- 
We also derive the posterior probability that the two de- 
tections arc from the same source 
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where Pq is the prior probability, and the approximation 
is for the usually small priors. 

The Bayes factor, and hence the posterior probabil- 
ity depend on the position errors from both surveys. As 
we use a constant prior P , this implies that if all ob- 
jects have the same position errors within a survey, the 
posterior probability depends on the angular separation 
only. In this case, there is no difference between using a 
criterion based upon separation or probability. 

We use the posterior probability rather than the Bayes 
factor as a criterion. In the assumption of a constant 
prior probability, the posterior probability is a mono- 
tonic function of the Bayes factor. However, while we 
consider here for our case study that the prior is a con- 
stant, in practice it may vary over the sky. Note also that 
for instance two surveys with similar position error dis- 
tributions can have different priors and then a criterion 
defined on the basis of the Bayes factor for one survey 
can not be applied directly to the other one. 

In order to set the overlap between our two detection 
sets as described in sect. !2.3l to match the selection func- 
tions of the actual datasets, we need to compute the prior 
Pq from the data, using the actual cross-identification be- 
tween GALEX GR5 and SDSS DR7. 

The prior is given by 



sdssNgalex 



(3) 



N+ is the number of sources in the overlap between the 
various selections (angular, radial, etc . . . ) of the cata- 
logs considered for the cross identification, i.e. the num- 
ber of sources in the resulting catalo g. We use the self- 
consis tency argument discussed by Budavar i fc Szalavl 

(pool) 

J2 P = N * ( 4 ) 

to derive Pq. To choose the valu e of the prior, we 
use th e iterative process described in Budavar i fc Szalavl 
(2008). We start the process by setting N+ = 
min(NsDSS, Ngalex)- We then compute the sum of 
the posterior probabilities derived using eq. [5J Accord- 
ing to eq. BJ this sum gives us a new value for AT*. 
The same procedure is then repeated using this updated 
value, yielding an updated value of the prior as well. The 
chosen value for the prior is obtained after convergence; 
we hereafter call this value the observed prior. 

Then we set the overlap between the two detection sets 
in our mock catalog such that the prior value derived for 
the cross-identification in the simulations matches the 
observed one. We use the same iterative process as de- 
scribed above to determine Pq in the simulations. Fig- 



ure [3] shows this iteration process starting from AT* = 
-^galex f° r the case with all MIS objects (filled circles) 
or MIS S/N >3 objects (open circles). The procedure 
converges quickly in terms of number of steps. Note also 
that the query we use to compute the sum runs in roughly 
1 second on these simulations. 

The benefit of the use of simulations is that, in this 
case, once we set the overlap between the detection sets 
required to match the observed prior, we know the input 
value of AT* (i.e. the actual number of detections in the 
overlap between the two sets) and hence we can derive 
the prior corresponding to this number directly using eq. 
[3J which we call the true prior. We show this true prior 
on fig. [3] as solid line for the case of all MIS objects, 
and dashed line for MIS objects with S/N >3. The true 
priors we are required to use in order to match the data is 
slightly lower than the observed ones for both selections: 
4% lower for all MIS objects and 2.5% for MIS objects 
with S/N >3. In other words, we need to use less objects 
in the overlap between our detection sets than what we 
expect from the data. 

A different prior value implies a change in the posterior 
probability; however the latter also depends on the values 
of the Bayes factor B. Given the scaling of the relation 
between the posterior and prior probabilities (eq. , for 
low B values (B <C 1), a variation of 4% in the prior 
yields a variation in posterior probability of the same 
amount. For high B values (B ^> 1), the variation is 
about 0.5%. Hence this difference between the true and 
observed priors has a negligible impact on the values of 
the posterior probabilities derived afterwards. 
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Fig. 3. — Prior probability self-consistent estimation as a 
function of iteration step. Filled circles show the iteration for 
the case of all MIS objects, and open circles for MIS objects 
with S/N >3. The solid (dashed) line shows the true prior 
for all MIS objects (MIS objects with S/N >3). 

To quantify the quality of the cross-identification, we 
define the true positive rate, T and the false positive 
contamination, F. We can express these quantities as 
a function of the angular separation of the association, 
or the posterior probability. Let n[x) be the number of 
associations, where x denote separation or probability. 
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Fig. 4. — True positive contamination rate (in blue, solid 
lines) and false positive contamination (red, dashed lines) as a 
function of angular separation (left) and posterior probability 
(right). GALEX position errors from the full MIS sample 
yield the thick curves; the S/N > 3 constraint yields the thin 
curves. We also show the pos t erior probability thresholds 
defined as in Budavari & Szalay ( 2008) (vertical lines on right 
hand side plot). 

This number is the sum of the true and false positive 
cross-identifications: n(x) — nr(x) +np(x). We define 
the true positive rate and false positive contamination as 
a function of angular separation as 



N T 

J2 n ( x < i>) 



(5) 
(6) 



where Nt is the total number of true associations. Sim- 
ilar rates are defined as a function of the probability, 



T(P): 

F(P): 



J2n T (x > P) 
N T 

J2n F {x>P) 
En(x>P) ■ 



(7) 
(8) 



We use the detection merging process to qualify the 
cross-identifications as true or false. In our final mock 
catalog, a detection represents a set of detections that 
have been merged. We therefore consider a case as a true 
cross-identification where there is at least one detection 
in common within the two sets of merged detections. 

Figure [4] represents the true positive rate and the false 
contamination rate as a function of angular separation 
(left) and posterior probability (right). These results 
suggest that in the case of the SDSS GALEX- MIS cross- 
identification, it is required to use a search radius of 5" in 
order to recover all the true associations. In the case of 
all MIS objects, 90% of the true matches are recovered 
at 1.64" with a 2.6% contamination from false positive. 
As expected, results are better using objects with high 
signal-to-noise ratio (S/N > 3), where 90% of the true 
matches are recovered at 1.15" with a 1% contamina- 
tion. Turning to the posterior probability, the trends 
are similar to the ones observed as a function of sep- 
aration. However, the false positive contamination in- 
creases less rapidly with probability. For instance, a cut 
at P > 0.89 recovers 90% of the true associations, with a 
slightly lower contamination from false positive (2.3%). 



o 

Q. 

CD 



0.10 



0.01 
0.001 




0.010 0.100 
False positive contamination 



1.000 



Fig. 5. — Cross identification diagnostic plot: 1-true positive 
rate versus the false positive contamination. These quantities 
are computed as a function of probability (blue, solid lines) 
or separation (red, dashed lines) . Thick lines show the results 
for all GALEX MIS objects, and thin lines for GALEX MIS 
objects with S/N > 3. The dotted line represents the locus 
of 1 - T = F. 

We examine in details the benefits of using separation or 
probability as a criterion in Section 3J 

4. RESULTS 
4.1. Performance analysis 

Using the quantities defined above, we can build a di- 
agnostic plot in order to assess the overall quality of the 
cross-identification, and define a criterion to select the 
objects to use in practice for further analyses. We show 
on Fig. [5] the true positive rate against the false pos- 
itive contamination, computed as a function of proba- 
bility or angular separation. We can compare the false 
positive contamination that yields a given true positive 
rate threshold for each of these parameters. 

The results show that there are some differences be- 
tween criteria based on angular separation or posterior 
probability. Considering all GALEX MIS objects (solid 
lines on fig. [5]), for 1 — T > 0.18, the false contamination 
rate is slightly lower when using angular separation as a 
criterion. This range of true positive rates corresponds 
to angular separations smaller than 1.2". As there is a 
lower limit to the GALEX position errors, this translates 
into an upper limit in terms of posterior probability at a 
given angular separation. This in turn implies that the 
probability criterion does not appear as efficient as sep- 
aration for associations at small angular distances in the 
SDSS-GALEX case. 

At 1 — T < 0.18, this trend reverses: considering a 
criterion based on probability yields a lower false con- 
tamination rate. 

We can characterize these diagnostic curves by the 
Bayes threshold, where 1 — T = F, which minimizes 
the Bayes error. The location of this threshold is rep- 
resented on fig. [5] by the intersection between the di- 
agnostic curves and the dotted line. Our results show 
that this intersection happens at lower false positive con- 
tamination rate when using the posterior probability as 



5 



"I 0.4 

z: 
O 



SDSS X QALEX 


SDSS X GALEX MIS S/N >3 








1 GALEX MIS x 1 SDSS 




1 QALEX MIS x 2 SDSS 




1 GALEX MIS x many SDSS 




True 




False 







0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 
Probability Probability 

Fig. 6. — True positive rate (blue, thick lines) and false con- 
tamination rate (red, thin lines) as a function of probability 
for the one GALEX to one SDSS (solid lines), one GALEX to 
two SDSS (dashed lines), one GALEX to many SDSS (dot- 
ted lines) associations. The left panel show these rates for all 
GALEX MIS objects, and the right one for the GALEX MIS 
objects with S/N >3. Note that the curves representing the 
one GALEX to many SDSS associations can barely be seen 
as the value are too small. 
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Fig. 7. — 1-True positive rates as a function of the false 
contamination rate for the one GALEX to one SDSS (thick 
lines) and one GALEX to two SDSS (thin lines) associations. 
The rates are computed as a function of probability (blue, 
solid lines) or separation (red, dashed lines). The left panel 
show these rates for all GALEX MIS objects, and the right 
one for the GALEX MIS objects with S/N >3. 

criterion. 

For all GALEX objects, the separation where 1 — T = 
F, ipc, is equal to 2.307" and the probability, P c is 0.613. 
Using the angular separation as criterion, the Bayes error 
is then P e = 0.102; using the posterior probability, P e — 
0.091. For GALEX objects with S/N >3, <Ac =1.882", 
P c — 0.665; P e — 0.055 using the angular separation and 
P e = 0.049 using the posterior probability. 

These results show that a selection based on posterior 
probability yields better results (i.e., a lower false con- 
tamination rate, and lower Bayes error) than a selection 
based on angular separation. 

4.2. Associations 

Beyond the confused objects, the cross-identification 
list contains several types of associations, where a single 
detection in one catalog is linked to possibly more than 
one detection in the other. We list in table Q] the contin- 



TABLE 1 

Percentages of associations by type 



GALEX 


1 


SDSS 

2 


Many 


1 


74.061 (75.870) 


21.007 (18.595) 


2.577 (2.469) 


2 


1.146 (2.253) 


1.006 (0.697) 


0.188 (0.102) 


Many 


0.006 (0.009) 


0.007 (0.004) 


0.002 (0.001) 



Note. — Percentages of associations by type in the mock cat- 
alogs. The numbers in brackets give the percentages from the 
cross-identification of SDSS DR7 and GALEX GR5 data. All per- 
centages are given with respect to the total number of matches. 



gency table of the percentages of these types in the mock 
catalog and, in brackets, for the SDSS DR7 to GALEX 
GR5 cross-identifications. 

The main contribution is from the one GALEX to one 
SDSS (1G1S, 74%), but there are also, for the most sig- 
nificant ones, cases of one GALEX to two SDSS (1G2S, 
21%) or one GALEX to many SDSS (lGmS, 3%). Com- 
paring with the data, our mock catalogs are slightly pes- 
simistic in the sense that the fraction of one to one 
matches is lower than in the observations. However, 
these fractions match reasonably well enough, which en- 
ables us to discuss these cases in the context of our 
mock catalogs. We show on figure [6] the true positive 
and false contamination rates as a function of probabil- 
ity for the 1G1S (solid lines), 1G2S (dashed lines), and 
lGmS (dotted lines) associations. The 1G1S true as- 
sociations represent the bulk (up to 85%) of the total 
cross-identifications. There is also a significant fraction 
of true associations within the one 1G2S cases (up to 
nearly 13%), while the lGmS are around 1%. For the 
1G2S or lGmS cases, we use two methods to select one 
object among the various associations: the one corre- 
sponding to the highest probability or the smallest sepa- 
ration. We computed the true positive and false contam- 
ination rates for these cases as a function of the quantity 
used for the selection of the association. We compare the 
results from these two methods on figure [71 which shows 
the diagnostic curves for the 1G1S (thick lines), 1G2S 
(thin lines); we do not show here the lG2m as they rep- 
resent only 1% of the associations. The diagnostic curves 
present the same trend than the global ones (see Fig. 
[5]): the posterior probability criterion yields a lower false 
contamination rate than the angular separation criterion 
above some true positive rate value (e.g., 1 — T < 0.29, 
for 1G1S associations considering the cross-identification 
of all SDSS GALEX objects). 

This is however an artifact caused by the distribution 
of the GALEX position errors (see sect. 14. ip . For the 
1G2S or lGmS cases, these results show that true asso- 
ciations can be recovered by selecting maximal probabil- 
ity, with a low contamination from false positive (up to 
around 1%). 

On Figs. [6] and [Jj we compare the results from all 
GALEX MIS objects and GALEX MIS objects with S/N 
> 3. The quality of the cross-identifications are better 
for the latter, for all types of associations. 

4.3. Alternative Error model 
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Fig. 8. — GALEX position error as a function of NUV mag- 
nitude. Circles show the GALEX pipeline error, squares the 
alternative errors (see text). The solid line show the linear 
model we use to modify the GALEX pipeline errors. 

The accuracy of the analysis of the quality of the cross- 
identification strongly depends on the GALEX pipeline 
position errors. We use the real data, namely the an- 
gular separation to the SDSS sources measured during 
the cross-identification between GALEX GR5 and SDSS 
DR7, to get an alternative estimation of realistic errors. 
In principle the distribution of the angular separations 
of the associations depends on the combination of the 
GALEX and SDSS position errors. However, the lat- 
ter are significantly smaller than the former, so we con- 
sider the SDSS errors as negligible here. We compare 
on figure [8] the dependence on the NUV magnitude of 
the position error in the NUV band from the GALEX 
pipeline (circles on fig. [8]) and the distance to the SDSS 
sources (squares), considering only objects classified as 
point sources in SDSS. The angular separation between 
the sources of the two surveys are significantly larger 
than the quoted GALEX pipeline errors. These latter 
errors are a combination of a constant field error (equal 
in NUV to 0.42") and a Poisson term. In the range where 
both errors estimates are constant (18 <NUV< 20), this 
comparison suggests that the GALEX field error might 
be slightly underestimated. Fot fainter objects, our al- 
ternative error increase faster with magnitude than the 
GALEX pipeline errors, which might indicate that this 
dependence is not well reproduced by the Poisson term. 

We fitted a linear relation to modify the GALEX errors 
in order to match the angular separations to the SDSS 
sources 

NUV™ s ir = 2.2NUV poserr - 0.3 (9) 

where the position errors are in units of arcsec. This 
error model is shown as a solid line on figure [8] It repro- 
duces well the alternative errors for NUV < 22.5, which 
is similar to the 5<r l imiting magnitude fo r the MIS in 
the NUV band (22.7; iMorrissev et al.ll2007h . 

We fo llowed the same steps as described in sect. 12.21 
and 12.31 with these new errors and performed the cross- 
identification. The diagnostic curves we obtain are pre- 



Fig. 9. — Same as figure [5] using alternative position errors 
for GALEX sources (see text). 

sented on Fig. [9] 

The trends are similar to those observed using the 
GALEX pipeline errors. The quality of the cross- 
identification is nevertheless worse with the alternate er- 
rors. In this case, the values of angular separation and 
probability where 1 — T = F are %p c — 3.126", P c = 0.711 
for all GALEX objects. Using the angular separation as a 
criterion, P e = 0.144 (0.102 with the GALEX pipeline er- 
ror), and P e = 0.127 (0.091 with pipeline errors) with the 
posterior probability. For GALEX objects with S/N > 
3, ip c = 2.514", P c = 0.780 ; P e = 0.0958 (0.055, pipeline 
errors) using angular separation, and P e = 0.0812 (0.049, 
pipeline errors) with the posterior probability. 

In other words, the contamination from false positive 
is larger at a given true positive rate. For instance, for all 
GALEX MIS objects, with 90% of the true associations 
and considering posterior probability as a criterion, the 
contamination is 5% compared to 2.3% using the GALEX 
pipeline errors. Note also that the difference between the 
angular separation and the probability diagnostic curves 
is larger with this alternate error model. This suggests 
that the probability is a more efficient way than angular 
separation to select cross-identifications for surveys with 
larger position errors. 

4.4. Building a GALEX-SDSS catalog 

The combination of the results we presented can be 
used to define a set of criteria for constructing a reliable 
joint GALEX-SDSS catalog. It is natural to have differ- 
ent selections for each type of association. We will here 
focus on the 1G1S and 1G2S cases, as they represent 
around 95% of the associations. 

In Table [2] we propose a set of criteria, based on 
the posterior probability, to get 90% of the true cross- 
identifications, consisting of 80% of 1G1S and 10% of 
1G2S. We also list the corresponding false positive con- 
tamination. These cuts enable to build catalogs with 
1.8% of false positives when using all GALEX objects, 
or 0.8% when using GALEX objects with S/N > 3. 

5. CONCLUSIONS 



7 



TABLE 2 

Selection criteria for SDSS-GALEX sample 



Association Probability cut False positive contamination 

1 GALEX to 1 SDSS P > 0.877 1.6 

1 GALEX to 2 SDSS P > 0.955 0.2 

1 GALEX (S/N > 3) to 1 SDSS P > 0.939 0.7 

1 GALEX (S/N > 3) to 2 SDSS P > 0.982 0.1 



Note. — Posterior probability cuts to obtain 80% (10%) of the true associations for the one GALEX to one SDSS (one GALEX to 
two SDSS) matches. The corresponding false positive contamination percentages are also listed. The first two lines give the cuts for all 
GALEX MIS objects and the two last ones for the GALEX MIS objects with S/N > 3. 



We presented a general method using simple mock cat- 
alogs to assess the quality of the cross-identification be- 
tween two surveys which takes into account the angu- 
lar distribution and confusion of sources, and the re- 
spective selection functions of the surveys. We applied 
this method to the cross-identification of the SDSS and 
GALEX sources. We used the probabilistic formalism of 
iBudavarTfc Szalav (2008|) to study how the quality of the 
associations can be quantified by the posterior probabil- 
ity. Our results show that criteria based on posterior 
probability yield lower contamination rates from false 
positive than criteria based on angular separation. In 
particular, the posterior probability is more efficient than 
angular separation for surveys with larger position errors. 
Our study also suggest that the GALEX pipeline posi- 



tion errors might be underestimated and we described an 
alternative measure of these errors. We finally proposed 
a set of selection criteria based on posterior probability 
to build reliable SDSS-GALEX catalogs that yield 90% 
of the true associations with less than 2% contamination 
from false positives. 
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